The tutorial is divided into two parts:
In the first part we will learn how to create basic plots with plotly, how to create subplots, multiple axes or save our plots into a portable HTML file.
In the second part, the attendees will be divided into 2-3 groups and work on new data trying to create visualisation for that. They will need to save it as HTML file and send to me so we can share what they did!
import numpy as np
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly
import pandas as pd
plotly.offline.init_notebook_mode()
Scatter plot is the simplest plot we have - they are in the form of single points on the x-y plot.
Reference: https://plotly.com/python/line-and-scatter/
We are using numpy package to generate random integers - for both x and y axis.
random_X = np.random.randint(1, 100, 1000)
random_y = np.random.randint(1, 100, 1000)
We are creating the first and the simplest scatter plot.
The compulsory component we need to provide to create a plot is data component . What we do is:
1) create data component with go.Scatter() where we specify data on x and y axes, and type of plot
2) create figure with go.Figure(), where we provide the components for the plot (such as data component )
3) show the plot with fig.show()
data_component = go.Scatter(x=random_X,
y=random_y,
mode='markers')
fig = go.Figure(data=data_component)
fig.show()
Other component that we can provide to plotly plot is layout component as go.Layout () which defines how the figure looks like. You provide layout component into go.Figure().
data_component = go.Scatter(x=random_X,
y=random_y,
mode='markers')
layout_component = go.Layout(title='My first scatter plot',
xaxis_title='random x',
yaxis_title='random y')
fig = go.Figure(data=data_component,
layout=layout_component)
fig.show()
You can change how to markers look like with parameter marker which is in the form of dictionary. Some of the parameters it takes are:
Colour: You can just say what colour you want - for example 'red'. Or you can provide a code for the color that you can find here: https://plotly.com/python/discrete-color/
Size: This is given in integers - for example 12.
Symbol: Symbol is a part of styling markers. You can find the entire list here: https://plotly.com/python/marker-style/
Opacity: Transparency of the markers (from 0 - invisible, to 1 - fully visible)
Line: Border of the markers (as a dictionary)
data_component = go.Scatter(x=random_X,
y=random_y,
mode='markers',
marker=dict(size=12,
color='green',
symbol='hexagon',
opacity=0.5,
line=dict(width=2,
color='red')))
layout_component = go.Layout(title='My first scatter plot',
xaxis_title='random x',
yaxis_title='random y')
fig = go.Figure(data=data_component,
layout=layout_component)
fig.show()
Line plot is created in a very similar way to scatter plot: the only difference is what you specify as mode : mode = 'lines' .
Reference: https://plotly.com/python/line-charts/
The data we will try to visualise for line charts is daily average temperature in May 2019 - May 2020 in London. I have uploaded them on my GitHub so you can easily load them straight in the notebook by using pandas as below. The data are stored in pandas dataframe (pd df).
data_source = 'https://raw.githubusercontent.com/kamiloster/plotly_workshop/main/temperature_london.csv'
df = pd.read_csv(data_source)
df
We will compare the 3 different modes available: markers (scatter plot), marker+lines (scatter with lines), and lines (line plot). To the temperature passed in the y axis, I have added 15 and 30 oC into some of them: just so you can see the difference between the modes (otherwise they would lay on top of each other).
trace_markers = go.Scatter(x=df['date'],
y=df['tavg'],
mode='markers',
name='markers')
trace_lines = go.Scatter(x=df['date'],
y=df['tavg'] + 15,
mode='lines',
name='lines - added 15 oC')
trace_markers_lines = go.Scatter(x=df['date'],
y=df['tavg'] + 30,
mode='markers+lines',
name='markers+lines - added 30 oC')
data_component = [trace_markers, trace_lines, trace_markers_lines]
layout_component = go.Layout(title='Comparison of different modes: markers, lines and markers+lines',
xaxis_title='Date',
yaxis_title='Daily average temperature (oC)',
hovermode='x')
fig = go.Figure(data=data_component,
layout=layout_component)
fig.show()
Bar chart can come in 3 different types: normal, stacked, nested.
Reference: https://plotly.com/python/bar-charts/
The data I prepared is for Winter Olympics in 2018. They summarise how many Gold/Silver/Bronze and Total medals were achieved for each country.
data_source = 'https://github.com/kamiloster/plotly_workshop/raw/main/2018WinterOlympics.csv'
df = pd.read_csv(data_source)
df
data_component = go.Bar(x=df['NOC'],
y=df['Total'])
layout_component = go.Layout(title='Medals in 2018 Olympics',
xaxis_title='Country',
yaxis_title='Total number of medals')
fig = go.Figure(data=data_component,
layout=layout_component)
fig.show()
trace_1 = go.Bar(x=df['NOC'],
y=df['Gold'],
name='Gold')
trace_2 = go.Bar(x=df['NOC'],
y=df['Silver'],
name='Silver')
trace_3 = go.Bar(x=df['NOC'],
y=df['Bronze'],
name='Bronze')
data_component = [trace_1, trace_2, trace_3]
layout_component = go.Layout(title='Medals in 2018 Olympics',
xaxis_title='Country',
yaxis_title='Number of medals')
fig = go.Figure(data=data_component,
layout=layout_component)
fig.show()
trace_1 = go.Bar(x=df['NOC'],
y=df['Gold'],
name='Gold',
marker=dict(color='gold'))
trace_2 = go.Bar(x=df['NOC'],
y=df['Silver'],
name='Silver',
marker=dict(color='silver'))
trace_3 = go.Bar(x=df['NOC'],
y=df['Bronze'],
name='Bronze',
marker=dict(color='brown'))
data_component = [trace_1, trace_2, trace_3]
layout_component = go.Layout(title='Medals in 2018 Olympics',
xaxis_title='Country',
yaxis_title='Number of medals',
barmode='stack')
fig = go.Figure(data=data_component,
layout=layout_component)
fig.show()
Box plots are very important in statistical analysis. They show you how the data are distributed around mean/median, standard deviation and upper/lower whiskers (limits).
Reference: https://plotly.com/python/box-plots/
Abalone dataset is very popular in machine learning. Abalone is a type of shellfish. Often, their age is determined by counting the rings on the shell. Machine learning was used to correlate this age with other properties: length, diameter, height, and others. It is important to visualise the statistics of these properties.
data_source = 'https://github.com/kamiloster/plotly_workshop/raw/main/abalone.csv'
df = pd.read_csv(data_source)
df
You can also add traces to already created figure go.Figure() with fig.add_trace(), and then update the layout with fig.update_layout():
1) Create figure fig = go.Figure()
2) Add all the plots you want with fig.add_trace()
3) Update the layout with fig.update_layout()
4) Show the figure fig.show()
fig = go.Figure()
fig.add_trace(go.Box(y=df['length'],
name='Length'))
fig.add_trace(go.Box(y=df['diameter'],
name='Diameter'))
fig.add_trace(go.Box(y=df['height'],
name='Height'))
fig.add_trace(go.Box(y=df['whole_weight'],
name='Whole weight'))
fig.update_layout(title='Box plots for basic properties of abalone shellfish',
xaxis_title='Property',
yaxis_title='Property value')
fig.show()
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['length'],
name='Length'))
fig.add_trace(go.Histogram(x=df['diameter'],
name='Diameter'))
fig.add_trace(go.Histogram(x=df['height'],
name='Height'))
fig.add_trace(go.Histogram(x=df['whole_weight'],
name='Whole weight'))
fig.update_layout(title='Histogram for basic properties of abalone shellfish',
xaxis_title='Bin',
yaxis_title='Property count')
fig.show()
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['length'],
name='Length'))
fig.add_trace(go.Histogram(x=df['diameter'],
name='Diameter'))
fig.add_trace(go.Histogram(x=df['height'],
name='Height'))
fig.add_trace(go.Histogram(x=df['whole_weight'],
name='Whole weight'))
fig.update_layout(title='Histogram for basic properties of abalone shellfish',
xaxis_title='Bin',
yaxis_title='Property count',
barmode='stack')
fig.show()
Heat maps are very useful to visualise correlations between x-y-z - for example Pearson correlation coefficient.
Reference: https://plotly.com/python/heatmaps/
The next dataset we are looking at is hourly temperature average in Santa Barbara (in California).
data_source = 'https://github.com/kamiloster/plotly_workshop/raw/main/2010SantaBarbaraCA.csv'
df = pd.read_csv(data_source)
df
fig = go.Figure()
fig.add_trace(go.Heatmap(x=df['DAY'],
y=df['LST_TIME'],
z=df['T_HR_AVG']))
fig.update_layout(title='Hourly average across the week in Santa Barbara (California)',
xaxis_title='Day of the week',
yaxis_title='Hour of the day')
fig.show()
Sometimes, when we plot variables that have significant differences in their values, it makes more sense to create two y axis (or two x axis).
Reference: https://plotly.com/python/multiple-axes/
data_source = 'https://raw.githubusercontent.com/kamiloster/plotly_workshop/main/temperature_flow_rate.csv'
df = pd.read_csv(data_source)
df
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=df['Date'],
y=df['Temperature'],
name='Temperature'),
secondary_y=False)
fig.add_trace(go.Scatter(x=df['Date'],
y=df['Flow rate'],
name='Flow rate'),
secondary_y=True)
fig.update_layout(title_text="Plot with two axes: temperature and flow rate")
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="Temperature", secondary_y=False)
fig.update_yaxes(title_text="Flow rate", secondary_y=True)
fig.show()
Subplots are very useful when we want to compare different types of plots in one space.
Reference: https://plotly.com/python/subplots/
data_source = 'https://raw.githubusercontent.com/kamiloster/plotly_workshop/main/temperature_london.csv'
df = pd.read_csv(data_source)
correlation_map = df.corr()
df
fig = make_subplots(rows=2,
cols=2,
vertical_spacing=0.2,
subplot_titles=('Line plot',
'Heatmap',
'Histogram',
'Box plot'))
# Line plots - 1 x 1
fig.add_trace(go.Scatter(x=df['date'],
y=df['tmin'],
name='Minimum temperature',
mode='lines',
marker=dict(color='#B6E880')),
row=1,
col=1)
fig.add_trace(go.Scatter(x=df['date'],
y=df['tmax'],
name='Maximum temperature',
mode='lines',
marker=dict(color='#17BECF')),
row=1,
col=1)
fig.add_trace(go.Scatter(x=df['date'],
y=df['tavg'],
name='Average temperature',
mode='lines',
marker=dict(color='black')),
row=1,
col=1)
# Heatmap - 1 x 2
fig.add_trace(go.Heatmap(z=correlation_map,
x=df.columns[1:],
y=df.columns[1:],
showscale=False),
row=1,
col=2)
# Histograms - 2 x 1
fig.add_trace(go.Histogram(x=df['tavg'],
name='Average temperature',
marker=dict(color='black'),
showlegend=False),
row=2,
col=1)
fig.add_trace(go.Histogram(x=df['tmin'],
name='Minimum temperature',
marker=dict(color='#B6E880'),
showlegend=False),
row=2,
col=1)
fig.add_trace(go.Histogram(x=df['tmax'],
name='Maximum temperature',
marker=dict(color='#17BECF'),
showlegend=False),
row=2,
col=1)
# Box plots - 2 x 2
fig.add_trace(go.Box(y=df['tavg'],
name='Average temperature',
marker=dict(color='black'),
showlegend=False,
boxpoints='all'),
row=2,
col=2)
fig.add_trace(go.Box(y=df['tmin'],
name='Minimum temperature',
marker=dict(color='#B6E880'),
showlegend=False,
boxpoints='all'),
row=2,
col=2)
fig.add_trace(go.Box(y=df['tmax'],
name='Maximum temperature',
marker=dict(color='#17BECF'),
showlegend=False,
boxpoints='all'),
row=2,
col=2)
fig.update_layout(legend=dict(y=1.3, x=0),
barmode='stack')
fig.update_yaxes(title_text='Temperature (oC)',
row=1,
col=1)
fig.update_xaxes(title_text='Date',
showticklabels=False,
row=1,
col=1)
fig.update_yaxes(title_text='Count',
row=2,
col=1)
fig.update_xaxes(title_text='Bin',
row=2,
col=1)
fig.update_yaxes(title_text='Temperature (oC)',
row=2,
col=2)
fig.write_html('C:/Users/kamil/Documents/KTS/figure.html')
fig.show()